Rntbd health check improvement 2 #33464

xinlian12 · 2023-02-13T18:35:48Z

Continuous efforts to improvement Rntbd health check flow, especially for timeout detection.

Why the changes are needed
Based on few recent latency investigations, there are few patterns being identified which demand more aggressively connection closure.

Express routes: there are cases being identified that when express routes VM being deployed(restarted etc), the existing connections will not be closed, hence it causing all the connections connected to the express route VM in broken status. Establishing new connections are the only ways to recover.
For write workload: Write operation will only be sent to primary replica, so even during retry, it has higher chances that the request will be retried on the same connection.
For request timed out on an connection has been idled for a while

Changes included in this PR:

Introduced few more internal parameters to control how fast the channel will be closed when transit timeout being detected.

timeoutDetectionEnabled: Default true
timeoutDetectionDisableCPUThreshold: Default 90.0
timeoutDetectionTimeLimit: Default 60s
timeoutDetectionHighFrequencyThreshold: Default 3
timeoutDetectionHighFrequencyTimeLimit: Default 10s
timeoutDetectionOnWriteThreshold: Default 1
timeoutDetectionOnWriteTimeLimit: Default 6s

Few timeout scenarios would trigger a connection to be closed:

Within the timeoutDetectionTimeLimit: timeout has been observed, it does not matter how many timeout have been observed. This will help to detect a broken connection for sparse workload.
timeoutDetectionHighFrequencyThreshold + timeoutDetectionHighFrequencyTimeLimit: Timeout has happened very frequently, in this case, we would want to close the channel more frequently.
timeoutDetectionOnWriteThreshold + timeoutDetectionOnWriteTimeLimit: Timeout happened on write related operation. Since for write operation, only primary replica will be used, so we want to close the channel more aggressively as well.
timeoutDetectionDisableCPUThreshold: High CPU can cause high number of request timeout, when this happens, closing existing channels and re-establishing new ones will not help the situation but rather make it worse. When the cpu threshold being hit, timeout detection will be disabled and then it will automatically resumed when the CPU usage back below the configured threshold.

The above configs can be modified throughput system property:
            System.setProperty("COSMOS.TCP_HEALTH_CHECK_TIMEOUT_DETECTION_ENABLED", "false");
 System.setProperty(
                "azure.cosmos.directTcp.defaultOptions",
                "{"\"timeoutDetectionTimeLimit\":\"PT61S\", \"timeoutDetectionHighFrequencyThreshold\":\"4\", " +
                    "\"timeoutDetectionHighFrequencyTimeLimit\":\"PT11S\", \"timeoutDetectionOnWriteThreshold\":\"2\"," +
                    "\"timeoutDetectionOnWriteTimeLimit\":\"PT7S\"}");

NetworkRequestTimeout: changing Min allowed value from 5s to 1s.

Adding channel statistic in the CosmosDiagnostics

New Connection:

Reusing existing connection:

Timeout happened on the channel:

azure-sdk · 2023-02-13T19:01:45Z

API change check

APIView has identified API level changes in this PR and created following API reviews.

azure-cosmos

...smos/azure-cosmos-spark_3_2-12/src/main/scala/com/azure/cosmos/spark/CosmosClientCache.scala

sdk/cosmos/azure-cosmos/CHANGELOG.md

...om/azure/cosmos/implementation/directconnectivity/rntbd/RntbdClientChannelHealthChecker.java

xinlian12 · 2023-02-14T05:30:27Z

/azp run java - cosmos - spark

azure-pipelines · 2023-02-14T05:30:36Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2023-02-15T02:00:26Z

/azp run java - cosmos - tests

azure-pipelines · 2023-02-15T02:00:44Z

Azure Pipelines successfully started running 1 pipeline(s).

FabianMeiswinkel

Thanks Annie - Kudos! Very clear design and implementation.

xinlian12 · 2023-02-15T21:35:37Z

/azp run java - cosmos - tests

azure-pipelines · 2023-02-15T21:35:51Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2023-02-16T00:59:08Z

Failed tests:
openConnectionsAndInitCachesWithInvalidCosmosClientConfig - not caused the change in this PR, will fix in a separate Pr
before_ParallelDocumentQueryTest - Test locally and succeeded

xinlian12 · 2023-02-16T01:00:04Z

/check-enforcer override

FabianMeiswinkel and others added 6 commits February 10, 2023 03:55

Cosmos Metrics: allow enabling only metrics of certain categories

caec82a

Merge branch 'main' into RntbdHealthCheckImprovement-2

ecbea68

Merge branch 'main' into RntbdHealthCheckImprovement-2

0203818

rntbd healthcheck timeout detection improvement

17dfe43

refactor

daa6045

refactor

e595ad0

xinlian12 requested review from kushagraThapar, FabianMeiswinkel, kirankumarkolli, milismsft, aayush3011, simorenoh, jeet1995 and Pilchie as code owners February 13, 2023 18:35

ghost added the Cosmos label Feb 13, 2023

annie-mac added 2 commits February 13, 2023 10:43

refactor

7a12c96

update changelog

a899f2d

annie-mac added 2 commits February 13, 2023 15:49

fix tests

085fe67

fix naming

8553cff

FabianMeiswinkel reviewed Feb 14, 2023

View reviewed changes

...smos/azure-cosmos-spark_3_2-12/src/main/scala/com/azure/cosmos/spark/CosmosClientCache.scala Outdated Show resolved Hide resolved

FabianMeiswinkel reviewed Feb 14, 2023

View reviewed changes

sdk/cosmos/azure-cosmos/CHANGELOG.md Outdated Show resolved Hide resolved

FabianMeiswinkel reviewed Feb 14, 2023

View reviewed changes

...om/azure/cosmos/implementation/directconnectivity/rntbd/RntbdClientChannelHealthChecker.java Outdated Show resolved Hide resolved

update changelog

a4c9ec4

annie-mac added 2 commits February 14, 2023 11:07

Merge branch 'main' into RntbdHealthCheckImprovement-2

e335639

resolve comments

2b53031

FabianMeiswinkel approved these changes Feb 15, 2023

View reviewed changes

fix tests

d15c834

xinlian12 merged commit cd4e903 into Azure:main Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rntbd health check improvement 2 #33464

Rntbd health check improvement 2 #33464

xinlian12 commented Feb 13, 2023 •

edited

Loading

azure-sdk commented Feb 13, 2023

xinlian12 commented Feb 14, 2023

azure-pipelines bot commented Feb 14, 2023

xinlian12 commented Feb 15, 2023

azure-pipelines bot commented Feb 15, 2023

FabianMeiswinkel left a comment

xinlian12 commented Feb 15, 2023

azure-pipelines bot commented Feb 15, 2023

xinlian12 commented Feb 16, 2023

xinlian12 commented Feb 16, 2023

Rntbd health check improvement 2 #33464

Rntbd health check improvement 2 #33464

Conversation

xinlian12 commented Feb 13, 2023 • edited Loading

azure-sdk commented Feb 13, 2023

xinlian12 commented Feb 14, 2023

azure-pipelines bot commented Feb 14, 2023

xinlian12 commented Feb 15, 2023

azure-pipelines bot commented Feb 15, 2023

FabianMeiswinkel left a comment

Choose a reason for hiding this comment

xinlian12 commented Feb 15, 2023

azure-pipelines bot commented Feb 15, 2023

xinlian12 commented Feb 16, 2023

xinlian12 commented Feb 16, 2023

xinlian12 commented Feb 13, 2023 •

edited

Loading